This lab provides a fairly brief introduction to using INLA for Bayesian analysis in R. The first section will cover using INLA for non-spatial models (LMs, GLMS and mixed-effects models),. We’ll then go on to look at using INLA for spatial regression models. In addition to INLA, R has a large number of add-on packages for Bayesian analysis listed here. Some notable ones are:
Before starting the lab, you will need to set up a new folder for your working directory. Go to your geog6000 folder now and create a new folder for today’s class called lab11.
You’ll need the following files:
You will need to download these files from Canvas, and move them from your Downloads folder to the datafiles folder that you made previously.
Now start RStudio and change the working directory to lab11. As a reminder, you can do this by going to the [Session] menu in RStudio, then [Change working directory]. This will open a file browser that you can use to browse through your computer and find the folder.
The INLA package is not available through the usual CRAN repositories. Instead, you’ll need to install this manually. The following code will install the current stable version of INLA:
install.packages("INLA",repos=c(getOption("repos"),INLA="https://inla.r-inla-download.org/R/stable"), dep=TRUE)
Once this is installed, we can load the libraries that we’ll need for this lab
library(dplyr)
library(ggplot2)
library(lme4)
library(sf)
library(INLA)
library(spdep)
library(tmap)
We’ll start by using INLA to build a simple linear regression model of life expectancy (lifeExp) as a function of per capita GDP (gpdPercap) from the GapMinder data. Start by a) loading the data; b) centering the years on 1952; and c) log transforming GDP (to check why we do this last step, try making a histogram of the untransformed and transformed GDP values).
gap <- read.csv("../datafiles/gapminderData5.csv")
gap$year2 <- gap$year - 1952
gap$gdpPercap2 <- log(gap$gdpPercap)
Next, make an OLS regression model using lm() as a comparison. With the increasing complexity of Bayesian models, it is quite useful to make a traditional model as a verification that your results are correct. In the summary(), take note of the coefficients, and the residual standard error (this is an estimate of how much residual variance you have)
fit_lm <- lm(lifeExp ~ gdpPercap2, gap)
summary(fit_lm)
##
## Call:
## lm(formula = lifeExp ~ gdpPercap2, data = gap)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.778 -4.204 1.212 4.658 19.285
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.1009 1.2277 -7.413 1.93e-13 ***
## gdpPercap2 8.4051 0.1488 56.500 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.62 on 1702 degrees of freedom
## Multiple R-squared: 0.6522, Adjusted R-squared: 0.652
## F-statistic: 3192 on 1 and 1702 DF, p-value: < 2.2e-16
Now let’s set up our INLA model. For this simple model, it’s pretty straightforward if we just use the default priors:
inla_lm <- inla(lifeExp ~ gdpPercap2,
data = gap)
By default, the INLA model will not calculate any model diagnostics (to make it as efficient as possible), so we can add the following arguments to get a) WAIC and DIC scores and b) model estimates of life expectancy for each observation:
inla_lm <- inla(lifeExp ~ gdpPercap2,
data = gap,
control.compute = list(dic = TRUE, waic = TRUE),
control.predictor = list(compute = TRUE))
If you want to see some of the detail of INLA working, add verbose=TRUE to the inla() function. Now let’s look at the output
summary(inla_lm)
##
## Call:
## c("inla(formula = lifeExp ~ gdpPercap2, data = gap, control.compute =
## list(dic = TRUE, ", " waic = TRUE), control.predictor = list(compute =
## TRUE))" )
## Time used:
## Pre = 4.79, Running = 0.549, Post = 0.53, Total = 5.87
## Fixed effects:
## mean sd 0.025quant 0.5quant 0.975quant mode kld
## (Intercept) -9.099 1.228 -11.510 -9.099 -6.691 -9.099 0
## gdpPercap2 8.405 0.149 8.113 8.405 8.697 8.405 0
##
## Model hyperparameters:
## mean sd 0.025quant 0.5quant
## Precision for the Gaussian observations 0.017 0.001 0.016 0.017
## 0.975quant mode
## Precision for the Gaussian observations 0.018 0.017
##
## Expected number of effective parameters(stdev): 2.00(0.00)
## Number of equivalent replicates : 851.93
##
## Deviance Information Criterion (DIC) ...............: 11760.48
## Deviance Information Criterion (DIC, saturated) ....: 1707.78
## Effective number of parameters .....................: 3.03
##
## Watanabe-Akaike information criterion (WAIC) ...: 11760.97
## Effective number of parameters .................: 3.51
##
## Marginal log-Likelihood: -5899.79
## Posterior marginals for the linear predictor and
## the fitted values are computed
And there’s quite a lot. Note there is a table of fixed effects with the intercept and slope estimates, the WAIC and DIC, and the precision value. To show only the fixed effects,
inla_lm$summary.fixed
## mean sd 0.025quant 0.5quant 0.975quant mode
## (Intercept) -9.099393 1.2277173 -11.510106 -9.099428 -6.690708 -9.099395
## gdpPercap2 8.404902 0.1487672 8.112785 8.404897 8.696771 8.404902
## kld
## (Intercept) 1.500902e-09
## gdpPercap2 1.502466e-09
Note that as this is a Bayesian model, we don’t get a standard error and \(p\)-value, instead we get a description of the posterior distribution of each coefficient, so the summary gives the mean, median, mode, s.d. and the 2.5 and 97.5 percentiles (0.025quant and 0.975quant). As a quick rule of thumb, if these percentiles have the same sign, we can assume that there is a credible effect (note the difference in language here). You can get a quick plot of these posterior distributions with INLA’s plot() function. By default, this will try to plot everything from the model, so it is easier to only select the fixed effects for now:
plot(inla_lm, plot.fixed.effects = TRUE,
plot.random.effects = FALSE,
plot.hyperparameters = FALSE,
plot.predictor = FALSE)
In the linear model, we got an estimate of the residual standard deviation of around 7.6. This is a measure of the size of the errors from the model. In the INLA output, we get an estimate of the residual precision instead, which is the inverse of the variance. You can extract this with:
inla_lm$summary.hyperpar
## mean sd 0.025quant
## Precision for the Gaussian observations 0.01724183 0.0005814291 0.01611705
## 0.5quant 0.975quant mode
## Precision for the Gaussian observations 0.01723434 0.01840503 0.01721826
To compare, we can invert this and take the square root to convert back to standard deviation units, and to check that the models are estimating the same value:
sqrt(1 / 0.0172)
## [1] 7.624929
Which matches the estimate from the lm() function. And we can plot the posterior distribution:
plot(inla_lm, plot.fixed.effects = FALSE,
plot.random.effects = FALSE,
plot.hyperparameters = TRUE,
plot.predictor = FALSE)
As INLA calculated the Bayesian posterior estimates for each observation, we can compare these to the observed life expectancy. The posterior estimates are in:
inla_lm$summary.linear.predictor
We’ll just plot the mean of these predictions against the observations:
plot(gap$lifeExp, inla_lm$summary.linear.predictor$mean)
abline(0,1)
In the next example, we’ll build a generalized linear model (GLM) using the Irish education dataset. In this dataset, we are interested in modeling the probability of a student taking the leaving certificate (lvcert) using a verbal reasoning score (DVRT). Load the data and center the DVRT scores:
irished <- read.csv("../datafiles/irished.csv")
irished$DVRT.cen <- irished$DVRT - mean(irished$DVRT)
Build a standard glm for comparison:
fit_glm <- glm(lvcert ~ DVRT.cen,
data = irished,
family = binomial(link = 'logit'))
summary(fit_glm)
##
## Call:
## glm(formula = lvcert ~ DVRT.cen, family = binomial(link = "logit"),
## data = irished)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9077 -0.9810 -0.4543 1.0307 2.1552
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.278422 0.099665 -2.794 0.00521 **
## DVRT.cen 0.064369 0.007576 8.496 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 686.86 on 499 degrees of freedom
## Residual deviance: 593.77 on 498 degrees of freedom
## AIC: 597.77
##
## Number of Fisher Scoring iterations: 3
Again, note the coefficient estimates. Now let’s build the same model with INLA. Here, we simply specify the family argument (the default is gaussian), and use the same arguments as before to estimate model diagnostics:
inla_glm <- inla(lvcert ~ DVRT.cen,
data = irished,
family = "binomial",
control.compute = list(dic = TRUE, waic = TRUE),
control.predictor = list(compute = TRUE))
summary(inla_glm)
##
## Call:
## c("inla(formula = lvcert ~ DVRT.cen, family = \"binomial\", data =
## irished, ", " control.compute = list(dic = TRUE, waic = TRUE),
## control.predictor = list(compute = TRUE))" )
## Time used:
## Pre = 5.54, Running = 0.323, Post = 0.383, Total = 6.24
## Fixed effects:
## mean sd 0.025quant 0.5quant 0.975quant mode kld
## (Intercept) -0.279 0.100 -0.476 -0.279 -0.084 -0.278 0
## DVRT.cen 0.065 0.008 0.050 0.064 0.080 0.064 0
##
## Expected number of effective parameters(stdev): 2.00(0.00)
## Number of equivalent replicates : 249.92
##
## Deviance Information Criterion (DIC) ...............: 597.75
## Deviance Information Criterion (DIC, saturated) ....: 597.75
## Effective number of parameters .....................: 1.99
##
## Watanabe-Akaike information criterion (WAIC) ...: 597.79
## Effective number of parameters .................: 2.02
##
## Marginal log-Likelihood: -306.61
## Posterior marginals for the linear predictor and
## the fitted values are computed
As before, we can plot out the posterior distributions of the intercept and slope. How do these compare to the glm output?
plot(inla_glm, plot.fixed.effects = TRUE,
plot.random.effects = FALSE,
plot.hyperparameters = FALSE,
plot.predictor = FALSE)
Note that there is no separate estimate of residual variance in a binomial model.
Next, we’ll estimate a mixed-effects model for the Gap Minder data, with a random intercept. First, we’ll build a comparison model using lmer() from the lme4 package. We’ll again model the trend of life expectancy over time within the dataset as a linear fixed effect:
fit_lmer <- lmer(lifeExp ~ year2 + (1 | country), gap)
summary(fit_lmer)
## Linear mixed model fit by REML ['lmerMod']
## Formula: lifeExp ~ year2 + (1 | country)
## Data: gap
##
## REML criterion at convergence: 9866.3
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -6.1692 -0.4910 0.1216 0.6030 2.4194
##
## Random effects:
## Groups Name Variance Std.Dev.
## country (Intercept) 123.15 11.097
## Residual 12.85 3.584
## Number of obs: 1704, groups: country, 142
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 50.51208 0.94547 53.43
## year2 0.32590 0.00503 64.79
##
## Correlation of Fixed Effects:
## (Intr)
## year2 -0.146
In INLA, additional random effects can be added with the f() function. These include a large number of effects, including groups (as here) and spatial and temporal effects (see below). The format is f(ID, model), where ID is an index in the data set identifying groups or other structures and model is the type of random effect. Here, we’ll use the iid model with assumes that model errors are i.i.d. within countries.
inla_lmer <- inla(lifeExp ~ year2 + f(country, model = "iid"),
data = gap,
control.compute = list(dic = TRUE, waic = TRUE),
control.predictor = list(compute = TRUE))
summary(inla_lmer)
##
## Call:
## "inla(formula = lifeExp ~ year2 + f(country, model = \"iid\"), data =
## gap)"
## Time used:
## Pre = 6.04, Running = 0.558, Post = 0.343, Total = 6.94
## Fixed effects:
## mean sd 0.025quant 0.5quant 0.975quant mode kld
## (Intercept) 50.512 0.947 48.651 50.512 52.372 50.512 0
## year2 0.326 0.005 0.316 0.326 0.336 0.326 0
##
## Random effects:
## Name Model
## country IID model
##
## Model hyperparameters:
## mean sd 0.025quant 0.5quant
## Precision for the Gaussian observations 0.078 0.003 0.072 0.078
## Precision for country 0.008 0.001 0.006 0.008
## 0.975quant mode
## Precision for the Gaussian observations 0.083 0.078
## Precision for country 0.010 0.008
##
## Expected number of effective parameters(stdev): 141.77(0.154)
## Number of equivalent replicates : 12.02
##
## Marginal log-Likelihood: -4968.37
Check that the fixed effects are similar to those from lmer(). Again, you can plot these:
plot(inla_lmer, plot.fixed.effects = TRUE,
plot.random.effects = FALSE,
plot.hyperparameters = FALSE,
plot.predictor = FALSE)
If we plot the hyperparameters, we now get the posterior distribution for the residual variance and the variance explained by the random intercepts.
plot(inla_lmer, plot.fixed.effects = FALSE,
plot.random.effects = FALSE,
plot.hyperparameters = TRUE,
plot.predictor = FALSE,
plot.prior = TRUE)
We can compare these by extracting these values and back converting to variance. First, the lmer() effects
VarCorr(fit_lmer)
## Groups Name Std.Dev.
## country (Intercept) 11.0972
## Residual 3.5842
inla_lmer$summary.hyperpar
## mean sd 0.025quant
## Precision for the Gaussian observations 0.078002830 0.002800661 0.072475583
## Precision for country 0.008289739 0.001034976 0.006270206
## 0.5quant 0.975quant mode
## Precision for the Gaussian observations 0.078025292 0.08347545 0.078173381
## Precision for country 0.008302706 0.01030871 0.008409998
# Residual s.d.
sqrt(1 / 0.078)
## [1] 3.580574
# Intercept s.d.
sqrt(1 / 0.0082)
## [1] 11.04315
We can get the random effect for any country (the offset from the grand mean intercept). These are in inla_lmer$summary.random. Each row has the country name (this was the index used in the f() function) as well as the summary of the posterior distribution for that country. To find the random effect for Japan:
rowid <- which(inla_lmer$summary.random$country == "Japan")
inla_lmer$summary.random$country[rowid, ]
## ID mean sd 0.025quant 0.5quant 0.975quant mode kld
## 67 Japan 15.21874 1.386968 12.49579 15.21834 17.94135 15.21765 1.318092e-08
Indicating that Japan’s life expectancy is about 15 years above the global average over the period Gap observations, and the 2.5 and 97.5 percentile indicate that this is credible higher than average. We can also plot the marginal distribution of this estimate (this is the distribution), by extracting the relevant values from the inla_lmer$marginals.random:
japan_re <- inla_lmer$marginals.random$country[[rowid]]
plot(japan_re, type = 'l', xlab = "LifeExp", ylab = "p")
We’ll now turn to building spatial models with INLA. In the nuforc.zip file there is a shapefile with the counts of UFO sightings for US states between 1990 and 2019. Let’s load this now:
ufos <- st_read("../datafiles/nuforc/ufo_sf_annual.shp")
## Reading layer `ufo_sf_annual' from data source
## `/Users/u0784726/Dropbox/Data/devtools/geog6000/datafiles/nuforc/ufo_sf_annual.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 1421 features and 44 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -124.71 ymin: 24.54235 xmax: -66.98702 ymax: 49.36967
## Geodetic CRS: WGS 84
This contains quite a lot of information, but our models, we’ll need the count data (count) and the population size. We’ll subset out only 2014 using dplyr (you can also do this with ufos_2014 <- subset(ufos, year == 2014)):
ufos_2014 <- ufos %>%
filter(year == 2014)
And now use tmap to map out the counts and population size:
p1 <- tm_shape(ufos_2014) + tm_fill("count", style = "jenks") +
tm_borders() +
tm_layout(main.title = "UFO sightings 2014")
p2 <- tm_shape(ufos_2014) + tm_fill("population", style = "jenks") +
tm_borders() +
tm_layout(main.title = "State population 2014")
tmap_arrange(p1, p2)
We’ll start with a simple model of the count of sightings as a function of population size. We’ll use this as a demonstration, even though the outcome are counts, and a Poisson model would be a better approach (we’ll get to this shortly).
ufo_lm <- lm(count ~ population, ufos_2014)
summary(ufo_lm)
##
## Call:
## lm(formula = count ~ population, data = ufos_2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -261.79 -30.40 -12.68 22.04 219.68
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.452e+01 1.480e+01 2.332 0.024 *
## population 1.988e-03 1.543e-04 12.882 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 76.66 on 47 degrees of freedom
## Multiple R-squared: 0.7793, Adjusted R-squared: 0.7746
## F-statistic: 165.9 on 1 and 47 DF, p-value: < 2.2e-16
And now we’ll build the same using INLA:
ufo_inla_lm <- inla(count ~ population,
data = ufos_2014,
control.compute = list(dic = TRUE, waic = TRUE, cpo = TRUE),
control.predictor = list(compute = TRUE))
Let’s check the coefficients against the OLS model:
ufo_inla_lm$summary.fixed
## mean sd 0.025quant 0.5quant 0.975quant
## (Intercept) 34.524391174 1.485288e+01 5.251553181 34.523952272 63.756196425
## population 0.001987856 1.548172e-04 0.001682735 0.001987852 0.002292551
## mode kld
## (Intercept) 34.524398804 2.327035e-06
## population 0.001987856 2.321022e-06
We’ll now plot the posterior distribution of the coefficients and the estimate of the residual variance. Rather than using INLA’s plot() function, we’ll get the values directly from the model output and plot them. The posterior distribution of the fixed effects can be obtained from:
ufo_inla_lm$marginals.fixed
And the residual variance from:
ufo_inla_lm$marginals.hyperpar
plot_df <- as.data.frame(inla.tmarginal(function(x) x,
ufo_inla_lm$marginals.fixed$`(Intercept)`))
ggplot(plot_df, aes(x = x, y = y)) +
geom_line(size = 2) +
scale_x_continuous("Intercept") +
scale_y_continuous("P") +
theme_bw()
plot_df <- as.data.frame(inla.tmarginal(function(x) x,
ufo_inla_lm$marginals.fixed$population))
ggplot(plot_df, aes(x = x, y = y)) +
geom_line(size = 2) +
scale_x_continuous("Population") +
scale_y_continuous("P") +
theme_bw()
plot_df <- as.data.frame(inla.tmarginal(function(x) 1/x,
ufo_inla_lm$marginals.hyperpar$`Precision for the Gaussian observations`))
ggplot(plot_df, aes(x = x, y = y)) +
geom_line(size = 2) +
scale_x_continuous("Residual variance") +
scale_y_continuous("P") +
theme_bw()
Just as an example, the following code adds informed priors on the slope and intercept coefficient, by increasing the precision of the prior (effectively reducing the variance or range of values). The mean of the priors is left at zero. Try running this and compare the estimates of coefficients and the WAIC/DIC goodness of fit values.
ufo_inla_lm2 <- inla(count ~ population,
data = ufos_2014,
control.fixed = list(mean = 0, prec = 0.2,
mean.intercept = 0, prec.intercept = 0.2),
control.family=list(hyper=list(prec = list(prior = 'loggamma',
param = c(0.1, 0.1)))),
control.compute = list(dic = TRUE, waic = TRUE),
control.predictor = list(compute = TRUE))
summary(ufo_inla_lm2)
##
## Call:
## c("inla(formula = count ~ population, data = ufos_2014, control.compute
## = list(dic = TRUE, ", " waic = TRUE), control.predictor = list(compute
## = TRUE), control.family = list(hyper = list(prec = list(prior =
## \"loggamma\", ", " param = c(0.1, 0.1)))), control.fixed = list(mean =
## 0, prec = 0.2, ", " mean.intercept = 0, prec.intercept = 0.2))")
## Time used:
## Pre = 5.05, Running = 0.276, Post = 0.33, Total = 5.66
## Fixed effects:
## mean sd 0.025quant 0.5quant 0.975quant mode kld
## (Intercept) 0.698 2.218 -3.657 0.699 5.049 0.699 0
## population 0.002 0.000 0.002 0.002 0.002 0.002 0
##
## Model hyperparameters:
## mean sd 0.025quant 0.5quant
## Precision for the Gaussian observations 0.00 0.00 0.00 0.00
## 0.975quant mode
## Precision for the Gaussian observations 0.00 0.00
##
## Expected number of effective parameters(stdev): 1.02(0.004)
## Number of equivalent replicates : 48.02
##
## Deviance Information Criterion (DIC) ...............: 571.83
## Deviance Information Criterion (DIC, saturated) ....: 55.34
## Effective number of parameters .....................: 2.18
##
## Watanabe-Akaike information criterion (WAIC) ...: 577.38
## Effective number of parameters .................: 6.47
##
## Marginal log-Likelihood: -298.00
## Posterior marginals for the linear predictor and
## the fitted values are computed
We’ll use a conditional autoregressive (CAR) model to account for spatial dependency in the UFO data. This is a very similar approach to the spatial error model we looked at in a previous lab. In INLA, this means adding a new random effect to the model using the f() function. There’s a few things we need to do to set this up:
ufos_2014$idarea <- 1:nrow(ufos_2014)
poly2nb() function from spdep)ufos_nb <- poly2nb(ufos_2014)
ufos_W <- nb2mat(ufos_nb)
Quick plot of the spatial adjacencies:
plot(st_geometry(ufos_2014), reset = FALSE)
plot(ufos_nb, st_coordinates(st_centroid(ufos_2014)), add = TRUE, col = 2, lwd = 2)
With all that in place, we can build a spatial model. We’ll use the newer bym2 formulation of the CAR model. Note that the format is similar to specifying a random effect above: we need the index (idarea), the model (bym2) and the spatial weight matrix:
ufo_inla_bym <- inla(count ~ population + f(idarea, model = "bym2", graph = ufos_W),
data = ufos_2014,
control.compute = list(dic = TRUE, waic = TRUE),
control.predictor = list(compute = TRUE)
)
Now let’s look at the output:
summary(ufo_inla_bym)
##
## Call:
## c("inla(formula = count ~ population + f(idarea, model = \"bym2\", ", "
## graph = ufos_W), data = ufos_2014, control.compute = list(dic = TRUE,
## ", " waic = TRUE), control.predictor = list(compute = TRUE))" )
## Time used:
## Pre = 24.7, Running = 0.906, Post = 0.378, Total = 25.9
## Fixed effects:
## mean sd 0.025quant 0.5quant 0.975quant mode kld
## (Intercept) 34.525 14.753 5.487 34.524 63.531 34.525 0
## population 0.002 0.000 0.002 0.002 0.002 0.002 0
##
## Random effects:
## Name Model
## idarea BYM2 model
##
## Model hyperparameters:
## mean sd 0.025quant 0.5quant
## Precision for the Gaussian observations 0.00e+00 0.00e+00 0.000 0.000
## Precision for idarea 9.13e+05 3.53e+06 1.131 1079.850
## Phi for idarea 3.26e-01 1.03e-01 0.151 0.317
## 0.975quant mode
## Precision for the Gaussian observations 0.00e+00 0.000
## Precision for idarea 1.17e+07 0.602
## Phi for idarea 5.48e-01 0.296
##
## Expected number of effective parameters(stdev): 2.00(0.002)
## Number of equivalent replicates : 24.48
##
## Deviance Information Criterion (DIC) ...............: 567.49
## Deviance Information Criterion (DIC, saturated) ....: 52.88
## Effective number of parameters .....................: 2.60
##
## Watanabe-Akaike information criterion (WAIC) ...: 572.02
## Effective number of parameters .................: 6.08
##
## Marginal log-Likelihood: -281.16
## Posterior marginals for the linear predictor and
## the fitted values are computed
The main change to the output is that we now have three hyperparameter values:
Precision for the Gaussian observations: this is the precision on the non-spatial residualsPrecision for idarea: this is the precision on the spatial fieldPhi for idarea: this is the mixing parameter, which decomposes the precision for idarea into a spatial and non-spatial effect.Note that the precision for the Gaussian term is very high (the variance is low). This is likely not very well constrained, so we’ll try to add a prior on the spatial random effect to improve this. Here we specify a penalized complexity prior - note that we need one for the precision term and one for phi. The format for these priors is given by \((U, \alpha)\), where \(\alpha\) is the probability of the hyperparameter \(\xi\) exceeding \(U\): (\(P(T(\xi) > U) = \alpha\))
prior_bym2 <- list(
prec = list(
prior = "pc.prec",
param = c(1, 0.01)),
phi = list(
prior = "pc",
param = c(0.5, 2 / 3))
)
And let’s rebuild the model:
ufo_inla_bym2 <- inla(count ~ population + f(idarea, model = "bym2",
graph = ufos_W,
hyper = prior_bym2),
data = ufos_2014,
control.compute = list(dic = TRUE, waic = TRUE),
control.predictor = list(compute = TRUE),
control.inla = list(strategy = "adaptive",int.strategy = "eb")
)
ufo_inla_bym2$summary.hyperpar
## mean sd 0.025quant
## Precision for the Gaussian observations 1.784842e-04 3.889182e-05 1.133510e-04
## Precision for idarea 1.127060e+04 9.992121e+03 3.562608e+03
## Phi for idarea 6.477860e-01 9.705357e-02 4.676081e-01
## 0.5quant 0.975quant mode
## Precision for the Gaussian observations 1.747749e-04 2.651107e-04 1.677363e-04
## Precision for idarea 8.170725e+03 3.740793e+04 5.017224e+03
## Phi for idarea 6.444866e-01 8.371989e-01 6.214203e-01
These look a little better (or at least more consistent). From the results we can map out the spatial effect. This is the random effect in space as modeled by the bym2 model. Summary values are held in ufo_inla_bym2$summary.random$idarea. Note that only the first 49 values here represent the spatial random field:
ufos_2014$u <- ufo_inla_bym2$summary.random$idarea$mean[1:49]
tm_shape(ufos_2014) + tm_fill("u", style = "jenks") +
tm_borders() +
tm_layout(main.title = "UFO model spatial random effect")
We can also extract the Bayesian estimates of the number of sightings from summary.linear.predictor:
ufos_2014$yhat <- ufo_inla_bym2$summary.linear.predictor$mean
tm_shape(ufos_2014) + tm_fill("yhat", style = "jenks") +
tm_borders() +
tm_layout(main.title = "State population 2014")
We noted above that as our outcome variable is the count of sightings, a Poisson model would be a better choice. We can convert out model to use a Poisson distribution by adding the argument family = "poisson" (similar the the binomial model we looked at earlier). However, for these data, we may want to model them as rates, rather than as counts. A reasonable expectation for the count of something is that it will depend on the size of the underlying population. In this example, the number of sightings is strongly correlated with the population size of the each state (this is basically what we modeled in the previous section). Modeling rates means that we can investigate if the rate of sightings on a per capita basis is relatively higher or lower than the expected number of counts, if all states were equal. This gives us a model of relative rates. In a classic Poisson model, we model the log of counts (\(Y_i\)) as a function of covariates:
\[ log(Y_i) \sim \beta X_i + \epsilon \]
In a relative rate model, the observed count (\(Y_i\)) for location \(i\) is defined as the expected count for \(i\) multiplied by the relative rate (\(\theta_i\)):
\[ Y_i \sim Pois(E_i \theta_i) \]
And then we model the log of the rate term:
\[ log(\theta_i) = \beta X_i + \epsilon \]
To model this with INLA, we first need to estimate the expected value (\(E_i\)), the number of sightings in each state if all states have the same rate. We’ll do this in a two step process. First, calculate the rate (\(r_s\)) in the standard population. This is given by the total number of sightings in the data, divided by the total population:
rs <- sum(ufos_2014$count) / sum(ufos_2014$population)
rs
## [1] 0.002522805
This gives the per capita rate of UFO sightings for the US (about 1 in 400). Now to get \(E_i\) for each state, multiply the state population by this number:
ufos_2014$Ei <- ufos_2014$population * rs
We’ll also redefine the priors for the spatial model:
prior_bym2 <- list(
prec = list(
prior = "pc.prec",
param = c(0.6 / 0.31, 0.01)),
phi = list(
prior = "pc",
param = c(0.005, 1 / 10))
)
With this information, we can now build our model. We’ll start by simply modeling the counts with an intercept term. We specify the family as poisson and add the argument E to define the expected counts. We again include the bym2 model for the spatial effects. The final argument (control.inla) speeds up the calculation of the hyperparameters at the cost of some accuracy. You can ignore this.
ufo_inla_rr <- inla(count ~ 1 + f(idarea, model = "bym2",
graph = ufos_W,
hyper = prior_bym2),
data = ufos_2014, family = "poisson", E = Ei,
control.compute = list(dic = TRUE, waic = TRUE),
control.predictor = list(compute = TRUE),
control.inla = list(strategy = "adaptive",int.strategy = "eb")
)
summary(ufo_inla_rr)
##
## Call:
## c("inla(formula = count ~ 1 + f(idarea, model = \"bym2\", graph =
## ufos_W, ", " hyper = prior_bym2), family = \"poisson\", data =
## ufos_2014, ", " E = Ei, control.compute = list(dic = TRUE, waic =
## TRUE), ", " control.predictor = list(compute = TRUE), control.inla =
## list(strategy = \"adaptive\", ", " int.strategy = \"eb\"))")
## Time used:
## Pre = 23.4, Running = 0.353, Post = 0.365, Total = 24.1
## Fixed effects:
## mean sd 0.025quant 0.5quant 0.975quant mode kld
## (Intercept) 0.033 0.047 -0.06 0.033 0.126 0.033 0
##
## Random effects:
## Name Model
## idarea BYM2 model
##
## Model hyperparameters:
## mean sd 0.025quant 0.5quant 0.975quant mode
## Precision for idarea 4.759 1.30 2.660 4.610 7.738 4.328
## Phi for idarea 0.568 0.24 0.115 0.584 0.949 0.771
##
## Expected number of effective parameters(stdev): 45.04(0.00)
## Number of equivalent replicates : 1.09
##
## Deviance Information Criterion (DIC) ...............: 420.40
## Deviance Information Criterion (DIC, saturated) ....: 106.46
## Effective number of parameters .....................: 45.18
##
## Watanabe-Akaike information criterion (WAIC) ...: 414.14
## Effective number of parameters .................: 28.14
##
## Marginal log-Likelihood: -236.32
## Posterior marginals for the linear predictor and
## the fitted values are computed
As before, we can extract and plot the spatial random field
ufos_2014$u <- ufo_inla_rr$summary.random$idarea$mean[1:49]
tm_shape(ufos_2014) + tm_fill("u", style = "jenks") +
tm_borders() +
tm_layout(main.title = "UFO model spatial random effect (RR model)")
In addition, we can plot the estimated relative rates for each state from the summary of the fitted values:
ufos_2014$RR <- ufo_inla_rr$summary.fitted.values[, "mean"]
tm_shape(ufos_2014) + tm_fill("RR", palette = "-inferno") +
tm_borders() +
tm_layout(main.title = "UFO sightings (relative rates)")
Lastly, we can estimate excedence probabilities. This is the probability that a given location has a relative rate that exceeds a given threshold. To estimate this, we need the marginal posterior distribution of relative rates for each state. To illustrate how this works, we’ll first extract the posterior distribution for Oregon.
rowid <- which(ufos_2014$name == "Oregon")
mpd <- ufo_inla_rr$marginals.fitted.values[[rowid]]
plot(mpd, type = 'l', xlab = "RR", ylab = "p")
abline(v = 2, lty = 2)
To get the probability of exceeding a relative rate of 2, we need to calculate the integral to the right hand side of the vertical line in the figure above. INLA comes with a function (
inla.pmarginal()) that will do this for us. As this calculates the integral to the left hand side, we subtract the result from 1 to get the excedence probability
1 - inla.pmarginal(q = 2, marginal = mpd)
## [1] 0.9746807
Giving a 0.97 probability that sightings in Oregon exceed 2x the national average. Note that this function can also be used to estimate the credibility of coefficients being higher or lower than some threshold. We’ll now use R’s sapply() function to calculate this for all states, and map the results:
ufos_2014$exc <- sapply(ufo_inla_rr$marginals.fitted.values,
function(marg){1 - inla.pmarginal(q = 2, marginal = marg)})
tm_shape(ufos_2014) + tm_fill("exc", palette = "Blues") +
tm_borders() +
tm_layout(main.title = "Excedence probability (RR > 2)")
For a final model, we’ll add a covariate to our relative rate model, to see if there is a credible link to the rates of sightings. Here, we’ll load a data set with the number of airports per state, to test if the number of sightings is linked to this (and by extension the number of flights). Load this and log transform the number of airports.
airports <- read.csv("../datafiles/nuforc/airports.csv")
airports$lairports <- log(airports$airports)
Now, we’ll merge this with the 2014 UFO dataframe:
ufos_2014 <- merge( ufos_2014, airports, by.x = "postal", by.y = "state")
ggplot(ufos_2014, aes(x = airports, y = count)) +
scale_x_log10("Airports") + scale_y_log10("Sightings") +
geom_point()
ufo_inla_rr2 <- inla(count ~ lairports +
f(idarea, model = "bym2",
graph = ufos_W,
hyper = prior_bym2),
data = ufos_2014, family = "poisson", E = Ei,
control.compute = list(dic = TRUE, waic = TRUE),
control.predictor = list(compute = TRUE),
control.inla = list(strategy = "adaptive",int.strategy = "eb")
)
summary(ufo_inla_rr2)
##
## Call:
## c("inla(formula = count ~ lairports + f(idarea, model = \"bym2\", ", "
## graph = ufos_W, hyper = prior_bym2), family = \"poisson\", ", " data =
## ufos_2014, E = Ei, control.compute = list(dic = TRUE, ", " waic =
## TRUE), control.predictor = list(compute = TRUE), ", " control.inla =
## list(strategy = \"adaptive\", int.strategy = \"eb\"))" )
## Time used:
## Pre = 24.1, Running = 0.353, Post = 0.361, Total = 24.8
## Fixed effects:
## mean sd 0.025quant 0.5quant 0.975quant mode kld
## (Intercept) 0.419 0.482 -0.529 0.420 1.364 0.421 0
## lairports -0.068 0.084 -0.233 -0.068 0.097 -0.068 0
##
## Random effects:
## Name Model
## idarea BYM2 model
##
## Model hyperparameters:
## mean sd 0.025quant 0.5quant 0.975quant mode
## Precision for idarea 4.915 1.212 2.915 4.792 7.638 4.56
## Phi for idarea 0.142 0.144 0.005 0.093 0.542 0.01
##
## Expected number of effective parameters(stdev): 45.72(0.00)
## Number of equivalent replicates : 1.07
##
## Deviance Information Criterion (DIC) ...............: 420.61
## Deviance Information Criterion (DIC, saturated) ....: 106.67
## Effective number of parameters .....................: 45.87
##
## Watanabe-Akaike information criterion (WAIC) ...: 414.30
## Effective number of parameters .................: 28.51
##
## Marginal log-Likelihood: -242.86
## Posterior marginals for the linear predictor and
## the fitted values are computed
And we’ll finish by plotting the posterior distribution of the coefficient for the log number of airports:
plot_df <- as.data.frame(ufo_inla_rr2$marginals.fixed$lairports)
ggplot(plot_df, aes(x = x, y = y)) +
geom_line(size = 2) +
scale_x_continuous("log(Airports)") +
scale_y_continuous("p") +
theme_bw()
# Files used in lab
| Column header | Variable |
|---|---|
| sex | Sex of student (male = 0; female = 1) |
| DVRT | Vocal reasoning test score |
| fathocc | Prestige score of fathers occupation |
| lvcert | Taken leaving certificate (yes = 1; no = 0) |
| schltype | School type |
| Column header | Variable |
|---|---|
| country | Country |
| year | Year |
| pop | Population |
| continent | Continent/Region |
| lifeExp | Life expectancy (yrs) |
| gdpPercap | Per Capita GDP (USD 2007) |